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(57) Abstract: Method for identifying peptides 
and proteins, starting fix>m the corresponding 
tandem spectrometry data. More specifically, 
the method comprises performing, tandem mass 
spectrometiy on a sample containing one or 
more protein or peptide, reducing each resulting 
spectrum to a peak list, listing possible inter- . 
pretations for said peak list into an interpreted 
peak list taking into account physico-chemical 
knowledge, structuring said interpreted peak 
list into a structured representation taking into 
account biological knowledge, matching said 
structured representation with a biological 
sequence database, and determining the best 
peptide match or matches within said database. 
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BACKGROUND OF THE INVENTION 

• I 

5 1 . Field of the Invention 

This invention relates to the field of proteomics and particularly to 
methods and systems for identifying peptides and proteins starting from 
tandem spectrometry data (MS/MS data) obtained experimentally. More 
10 specif ically, the method comprises interpreting and structuring MS/MS 
data in a way allowing full exploitation of the information contained 
in it during matching of the structured data with biological sequence 
database. 

15 The following references are either cited in the text or relevant to 
the prior art: 

Bafna V. aAd Edwards N. <2001) . SCOPE: a probabilistic model for 
scoring tandem mass spectra against a peptide database. 
20 Bioinf ormatics Suppl 1, 13-21. 

Bairoch,A. and Apweiler^R. (2000). The SWISS-PROT protein sequence 
database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28« 
45-48. 

Barker, W.C., Garavelli , J . S . , Huang, H., McGarvey, P.B. , Orcutt,B.C., 
25 Srinivasarao,G . Y. , Xiao,C., Yeh,L.S., Ledley,R.S., aanda,J.F., 

Pfeiffer,F., Mewes,H.W., Tsugita,A., and Wu,C. (2000). The protein 
information resource (PIR) . Nucleic Acids Res. 28, 41-44. 
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Bartels C. (1990). Fast algorithm for peptide sequencing by mass 
spectrometry. Biomed. Environ. Mass. Spectrom. 19, 363-368. 

Benson,D.A., Karsch-Mizrachi,!., Lipman,D.J., Ostell,J./ Rapp,B.A., 
and Wheeler, D.L. (2002). GenBank. Nucleic Acids Res. 30, 17-20. 

Bonabeau E., Dorigo M. , and Theraulaz G. (1999). Swarm Intelligence 
From Natural to Artificial Systems. Oxford University Press). 

Chen,T., Kao,M.Y., Tepel,M., Rush, J., and Church, G.M. (2001). A 
dynamic programming approach to de novo peptide sequencing via 
tandem mass spectrometry. J. Comput. Biol, ff, 325-337. 

Clauser K.R., Hall S.G., Smith D.M. , Webb J.W., Andrews L.E., Tran 
H.M., Epstein L.B., and Burlingame A.L. (1995). Rapid mass 
spectrometric peptide sequencing and mass matching for 
characterization of human melanoma proteins isolated by two- 
dimensional PAGE. Proc Natl Acad Sci USA 92(11), 5072-5076, 

Dancik,V., Addona.T.A., Clauser, K.R., Vath,J.E., and Pev2ner,P.A. 
(1999), De novo peptide sequencing via tandem mass spectrometry, j. 
Comput. Biol. 6, 327-342. 

Dorigo, M. and Di Caro,G. (1999). The Ant Colony Optimization Meta- 
Heuristic. In New Ideas in Optimization, D.M.G.F.E.Corne D., ed. 

Edman,P. (1970). Sequence determination. Mol . Biol. Biochem. 
Biophys. 8, 211-255. 

Eng J.K., McCormack,A.L. , and Yates, I. J. R. (1994). An approach to 
correlate tandem mass spectral data of peptides with amino acid 
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sequences in a protein database. J. Ain. Soc . Mass Spectrom. 5, 976- 
989. 

Fenyo,D., Qin,J., and Chait , B.T. (1998). Protein identification 
using mass spec trome trie information. Electrophoresis 19, 998-1005. 

55 - Fernandez-de-Cossio, J. , Gonzalez, J., and Besada,V. (1995). A 

computer program to aid the sequencing of peptides in collision- 
activated decomposition e^qperiments . Con^ut. Appl. Biosci . 11, 427- 
434. 

Femandez-de-Cossib, J. , Gonzalez, J., Betancour t , L . , Besada^V., 
60 Padron^G., Shimonishi^Y. , and Takao,T. (1998). Automated 

interpretation of high-energy collision- induced dissociation spectra 
of singly protonated peptides by ' SegMS * , a software aid for de novo 
sequencing by tajidem mass spectrometry. Rapid Commun. Mass Spectrom. 
12, 1867-1878. 

65 - Fernandez-de-Cossio, J. , Gonzalez, J., Satomi,y., Shima,T., 

Okumura,N., Besada,V., Betancourt , L . , Padron,G., Shimonishi, Y. , and 
Takao,T. (2000) . Automated interpretation of low-energy collision- 
induced dissociation spectra by SeqMS, a software aid for de novo 
sequencing by tandem mass spectrometry. Electrophoresis 21, 1694- 

70 1699. 

Gatlin.C.L.. Eng,J.K., Cross, S.T., Detter,J.C., and Yates, J. R. , III 
(2000) . Automated identification of amino acid sequence variations 

in proteins by HPLC/microspray tandem mass spectrometry. Anal. Chem. 
72, 757-763. 
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f^f Goninet G.H. A tutorial Introduction to Computational Biochemistry 

ft » . 

, * I * Using Darwin. 1992. E.T.H. Zurich, Switzerland. 
) Ref Type: Report 

- Gras,R., Muller,M., Gasteiger,E., Gay,S., Bin2>P.A., Bienvenut, W. , 
Hoogland,c:« Sanchez , J . C. , Bairoch,A., Hochstrasser,p.F. , and 

80 Appel,R.D. (1999). Improving protein identification from peptide 

mass fingerprinting through a parameterized mult i -level scoring 
algorithm; and ian optimized peak detection'. Electrophpresis 20, 3535- 
3550. ' . 

- eras "R. , Gas teiger E., Chopard B.; Mttller M. , and Appel R.D. New 
85 learning method to improving protein identification from peptide • 

mass fingerprinting. 2000. 4th Siena 2D electrophoresis meeting. 
Ref Type: Conference Proceeding 

- Gras R. and Muller M. (2001). Computational aspects of protein 
identification by mass spectrometry. Current Opinion in Molecular 

90 Therapeutics 3/ 526-532. 

Mines W.M., Falick A.M., Burlingame A.L., and Gibson B.W. (1992). 
Pattern-based algorithm for peptide sequencing from tandem mass 
spectra of peptides. «T. American Society for Mass Spectrometry 3, 
326-336. 

95 - Ishikawa,K. and Niwa,Y. (1986). Computer-aided peptide sequencing by 
fast atom bombardment mass spectrometry. Biomed. Environ. Mass 
Spectrom 13, 373-380. 
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I Johnson, R.s. and Biemanh.K. (1989). Conqputer program (SEQPEP) to aid 
in the interpretation of high-energy collision tandem mass spectra 
of peptides. Biomed. Environ. Mass Spectrom 18. 945-957. 

- Johnson/Rls. and Taylor, J.A. (2000). Searching sequence databases 
via de novo peptide sequencing by tandem mass spectrometry. Methods 
Mol. Biol. 146, 41-61. 

- KennedSr a*, and Eberhart R.C. (2001). Swarm Intelligence. Morgan 
105 Kaufmann) . 

•I . , 

- Mann,M. Hojrup,P. , and Roepstorf £, P. (1993). Use of mass 

sp^ctrometric molecular weight information to identify proteins, in 
sequence databases. Biol. Mass Spectrom 22, 338-345. 

- Mann,M. and Wilm,M. (1994). Error- tolerant identification of 

110 peptides in sequence databases by peptide sequence tags. Anal. Chem. 

66, 4390-4399. 

- Pappin D.D...J.; Hojrup P., and Bleasby A.J. (1993). Rapid 
identification of proteins by peptide-mass finger printing. Curr 
Biol 3, 327-332. 

lis - Perkins D.N. , Pappin D.D.J. , Creasy D.M., and Cottrell J.S. (1999). 
Probability-based protein identification by searching seqiience 
databases using mass spectrometry data. Electrophoresis 20, 3551- 
3567. 
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Pevzner^P.A. , Dancik,V., and Tang, C.L. (2000). Mutation- tolerant 
protein identification by mass spectrometry. J. Comput. Biol. 7, 
777-787. 
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- Pevzner,P.A., Mulyukov,Z,, Dancik,V. , and Tang, C.L. (2001). 

Efficiency of database search for identification of mutated and 
modified proteins via mass spectrometry. Genome Res. 11, 290-299. 

125 - Sakurai T. , Matsuo T., Matsuda H., and Katakuse I. (1984). Paas 3: 
computer program to determine probable sequence of peptides from 
mass spectrometric data. Biomed. Mass Spectrom. 11(8), 396-399. 

Siegel,M.M. and Bauman>N. (1988). An efficient algorithm ^f or 
sequencing peptides using fast atom bombardment mass spectral data. 

1:^0 Biomed. Environ. Mass Spectrom. 15, 333-343. 

I' ■ , 

Stbesser^G., Baker, W., van den,B.A., Camon,E., Garcia-Pastor ,M. , 
Kanz.C, Kulikova,T., Leinonen,R., Lin,Q., Lombard, V., Lopez,R., 
Redaschi,N. , Stoehr,P. , ,Tuli,M.A. , .Tzouvara,K. , and Vaughan,R. 
(2002) . The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 
135 30, 21-26. 

Tateno, Y . , Imanishi , T. , Miyazaki , S . , Fukami-Kobayashi , K. , Saitou,N. 
Sugawara,H., and Gojobori,T. (2002). DNA Data Bank of Japan (DDBJ) 
for genome scale research in life science. Nucleic Acids Res. 30, 
27-30. 

140 - Taylor, J. A. and Johnson, R.s. (1997). Sequence database searches via 
de novo peptide sequencing by tandem mass spectrometry. Rapid 
Commun. Mass Spectrom. 11, 1067-1075. 

Taylor, J. A. and Johnson, R.S. (2001). Implementation and uses of 
automated de novo peptide sequencing by tandem mass spectrometry. 
145 Anal. Chem. 73, 2594-2604. 
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Wilkins M.R., Gasteiger E., Bairoch A.., Sanchez J.C., Williams K.L., 
Appel R.D., and Hochstrasser D.F. (1999a). Protein identification 
and analysis tools in ExPASy server. Methods Mol Biol 112, 531-552. 

Wilkins M.R. , Gasteiger E.; Wheeler C.H., Lindskog I., Sanchez J.C., 
Bairoch A., Appel R.D., Dunn M.J., and Hochstrasser D.F. (1999b). 
Multiple parameter cross-species protein identification using 
Multident - a world-wide web accessible tool. Electrophoresis 19, 
3199-3206. 

Yates, I.J. R. , Eng J.K., and McCormak A.L. (1995). Mining genomes : 
155 correlating tandem mass spectra of modified and unmodified peptides 

to sequences in nucleotide databases. Anal. Chem. 67(18), 3202-3210. 

YatesIII J.R., EngJ.K., ClauserK., and Burl ingame A.L. (1996). 
Search of Sequence Databases with Uninterpreted High- Energy 
Collision- Induced Dissociation Spectra of Peptides. J. American 
160 Society for Mass Spectrometry 7, 1089-1098. 

Zhang, W. and Chait.B.T. (2000). ProFound: an expert system for 
protein identification using mass spectrometric peptide mapping 
information. Anal. Chem. 72, 2482-2489. 



165 2. Description of the Prior Art 

Proteomics is the study of the proteins resulting from the expression 
of the genes contained in genomes. Due to important variations of 
protein expression between cells having the same genome, there are many 
170 proteomes for each corresponding genome. As a result, huge amounts of 
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bomplex than the study of the genome. 



A typical goal of proteomics is to identify the protein expression in a 
175 given tisstie or cell under given conditions. An additional goal of 
proteomics is to compare the protein expression in the same tissue, 
cell or physiological fluid under varying conditions (for example* 
disease v@ control), and identify the proteins that are differently 
expressed". 

180 . • ' 

In recent years, proteomics research. has gained importance due to 
increasingly powerful techniques in protein purification/separation, . 
mass spectrometry and identification techniques, as well as the 
development of extensive protein and nucleic databases from various 

185 organisms. 

A traditional method for analyzing proteomes involves separation by 1-D 
and 2-D polyacrylamide-gel electrophoresis. The 1-D gel method is 
generally used to achieve a crude separation of cell lysates where the 

190 most abundant proteins can be separated and detected. 2-D gel 

electrophoresis' is a more powerful method capable of separating out 
hundreds of protein spots, where the spot pattern is characteristic of 
protein expression. Typical separation criteria by gel electrophoresis 
include electrical charge (isoelectric point - pi) and molecular 

195 weight. Gel electrophoresis methods (1-D and 2-D) have nevertheless 
certain fundamental limitations for screening and identification of 
proteins. Notably, gel electrophoresis separations are slow and have a 
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' ^ limited resolution (i.e. can only distinguish between a limited number 
[ of proteins (spots)). In recent years « automation has allowed to manage 
200 larger quantities of data resulting from 2-D gel electrophoresis, as 
exemplified by US Pat. No. 5,993,627. US Pat. No. 6 , 277, 259, and WO 
00/55636. 



Higher resolution can be attained by other chromatography separation 
205 methods such' as capillary electrophoresis, gas chromatography, micro- 
channel networks, liquid chromatography and high-pressure liquid 
chromatography (HPLC) , used ±n complement to gel electrophoresis or 
alone. These methods allow the separation of greater niunbers.of 
proteins, even in hard conditions (low sample quantities, small 
210 molecular weight, highly basic or hydrophobic proteins....). Separation 
criteria include electrical charge and molecular weight as in gel 
electrophoresis, as well as hydrophobicity and other physico-chemical 
criteria. 

215 After separation, the proteins must be identified, by sequencing or 
other means.. Determining the sequence of amino acid residues in a 
protein was traditionally acconqplished by means of N- terminal Edman 
degradation (Edman, 1970) . Edman sequencing unfortunately requires 
important quantities of a protein (in the order of 10-100 pmols) , which 

220 exceed the quantities obtained from most current separation techniques. 
In practice, Edman sequencing is possible only after 1-D or 2-D gel 
electrophoresis, and then only for the most abundant protein species 
found. 
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225 Today, most large-scale protein identification procedures use mass 
spectrometry (MS) data as a starting point rather than Edman 
degradation. Mass spectrometry accurately determines the molecular mass 
of the analyzed protein. Additional information can be obtained by 
cleavage of the protein into smaller peptides before performing the 

230 mass spectrometry. Cleavage of proteins is usually done by enzymatic 
means, most commonly by trypsin which cleaves specifically the C-. 
terminal side of arginine or lysine. 

' There are several identification methods from mass spectrometry data 
.235 <Gras and Muller, 2001). The most widely used method consists in 
measuring masses of peptides resulting from the digestion process by 
mass spectrometry. The resulting MS spectrum represents a peptide mass 
fingerprint (PMF) , which is characteristic for each protein. 
Identification by peptide mass fingerprint requires a pre-existing 

240 protein database, either directly produced or derived from a nucleic 
database. Identification is done by comparing the experimental 
masses/spectra obtained by MS (PMF) and the theoretical masses /spectra 
of virtually digested protein sequences present in the database. The 
shared masses between the experimental and theoretical spectra are used 

245 in a more or less elaborated scoring function to identify the protein. 
Some tools only count the number of matches, such as PepSea <Maxin et 
al., 1993), PeptideSearch (Mann and Wilm, 1994), Peptldent/Multldent 
(Wilkins et al . , 1999a; Wilkins et al . , 1999b), while others use a 
probabilistic and/or statistic approach, such as MassSearch (Gonnet, 

250 1992), MOWSE (Pappin et al . , 1993), MS-Fit (Clauser et al., 1995), 
Mascot (Perkins et al . , 1999), ProFound (Zhang and Chait, 2000). 
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Finally, the algorithm developed by Gras, Smartldent (Gras et al . , 
1999; Gras et al., 2000), uses a machine learning approach. 

255 Unfortunately, the PMF method may not always succeed in giving a 
reliable identification, for example when the concentration of the 
protein of interest is low, when only a few peptides are found after 
the digestion process or when the protein of interest is insufficiently 
purified. in addition, post-translational modifications (PTMs) or 

260 polymorphisms may modify the peptide masses and impair proper matching. 
Finally, it is possible that the protein of interest is simply not 
present in the protein database, and therefore cannot be matched. 

In cases where identification is uncertain, one can use tandem mass 
265 spectrometry (MS/MS) . MS/MS spectra are obtained after selection of a 
peptide coming from the digestion process of the protein of interest, 
subsequent fragmentation of said peptide (for example, by collision 
with a rare gas), and measurement of the produced fragment masses. 
Ideally, fragmentation occurs between every amino acid of the peptide, 
270 and the masses of two adjacent ionic peaks differ by the mass of one 
amino acid. In addition to a PMF similar to the one obtained from MS 
identification, MS/MS data provide information concerning the peptide 
sequence and allow a more detailed interpretation level than MS spectra 
alone. 

275 

Exploiting the information contained in MS/MS spectra is difficult due 
to various factors. Notably, the fragmentation process is hardly 
foreseeable and depends, among other things, on the amount of energy 
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? ,?i : \lfsed by the mass spectrometer, on the number and the repartition of the 

• ; 

280 ^ Charges carried by the ionic fragment, on its sequence, etc.. 

\ 

Two main identification strategies have been devised to exploit MS/MS 
data: de novo sequencing followed by sequence matching, and direct 
spectrum matching with theoretical spectra from an existing database. 

285 

Db novo sequencing consists in deriving a peptide sequence from its 
MS/MS spel'ctahjm without use of any information extracted from a pre- 
existing protein or nucleic database. To do so, de novo sequencing uses 
not only the mass values represented, by peaks in the mass spectra, but 

290 also their position respective to each other. Early methods required 
generating all possible sequences whose masses are similar to the 
spectriun's parent mass and all the corresponding virtual spectra, PAAS3 
(Sakurai et al . , 1984). The experimental spectruon was then compared and 
matched with the virtual spectra. This approach was rapidly abandoned 

295 due to the combinatorial explosion it implies. Another strategy was to 
make successive possible extension of sequences (Ishikawa and Niwa, 
1986). The sequences are built by successive extension with one or more 
amino acids.* For each iteration, the sub-sequences and the 
corresponding virtual spectra are compared with the experimental 

300 spectrum, and the most divergent sequences are eliminated. Still 
another, more sophisticated strategy uses the information lying in the 
succession of the peaks to make the sequence extensions (Siegel and 
Bauman, 1988), SEQPEP (Johnson and Biemann, 1989). In this approach, 
the peptide sequence is built step by step, from the masses differences 

305 of '"neighbor* peaks in the spectrum. This method can be viewed as the 
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• ^ precursor of methods based on graph representation (Bartels, 1990), 
\ (Hines at al., 1992), SegMS ( Fernandez-de-Cos sio et al., 1995; 
Fernandez-de-Cossio et al . , 1998; Fernandez-de-Cossio 'et al . , 2000), 
Lute£isk97. (Taylor, and Johnson, 1997; Johnson and Taylor, 2000; Taylor 

310 and Johnson, 2001), SHERENGA (Dancik et al., 1999), (Chen et al . , 
2001). The vertices in the graph are built from the peaks of the 
spectrum and represent masses, of potential fragments. Physico-chemical 
properties ate taken into account to associate a score to each vertex. 
Whenever two vertices differ by the mass of one or several amino acid, 

315 they ar,p connected by an arc. Therefore, each path in the graph 
represent a possible sequence that can be built from the spectrum. 
Special algorithms then search the graph for the best paths (i.e. 
having the highest score built from the ^vertices score belonging to the 
path), allowing to determine the most probable sequence or sequences 

320 corresponding to the experimental spectrum. Accordingly, de novo 
sequencing results in one or a limited number of possible amino acid 
sequence, obtained without any recourse to a protein or nucleic 
database . 

♦325 For identification purposes, the sequence(s) (partial or complete) 
obtained de novo are then used to scan a protein database with a 
standard alignment software. De novo sequencing is a fairly complex 
task which requires both good quality spectra and manual verification 
by a mass spectrometry expert. Accordingly, this approach is not 

330 adapted to the huge amounts of data generated by high- throughput 
settings available today. 
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The alternative to de novo sequencing is to match the experimental 
peptide spectra obtained from MS/MS with theoretical spectra derived 

335 from pre-existing protein databases. Unlike de novo sequencing, most 
MS/MS spectra inatching tools use only the mass values in the MS/MS 
spectra - to the exclusion of their respective positions. The method 
most used today for MS/MS identification is the shared peak count 
(SPC) . The ionic masses of the MS /MS spectrum represent an *ion mass 

340 fingerprint", by analogy with the "peptide mass f ingerprint* . The 
experimental MS /MS spectrum is conqpared with theoretical ion mass 
fingerprints of virtually digested and fragmented proteins in the 
database. Their similarity is determined by a combination of 
independent scores of correlations between . the experimental and 

345 theoretical common masses. 

.Various SPC algorithms have been developed. All are based on a 
probabilistic score depending on the mass errors and differ mainly by 
their scoring function, which can be more or less sophisticated. MSTag, 

350 PepFrag (Fenyo et al . , 1998), and MASCOT (Perkins et al., 1999) are 
examples. One algorithm - SCOPE (Bafna and Edwards, 2001) - uses both a 
complex probabilistic model and a dynamic programming method. Another 
algorithm, SEQUEST (Eng et al., 1994; Yates et al., 1995; Yates et al., 
1996; Gatlin et al . , 2000) , uses two filtering levels: SPC followed by 

355 cross-correlation by means of fast Fourier transformation. Concerning 
modifications, any mutation or PTM of the source protein is susceptible 
to drastically modify the MS/MS spectra in comparison to the unmodified 
protein in the reference database: modified fragment masses' are shifted 
by a delta corresponding to the mass difference brought by the 
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360 ^ modification/mutation. As a result, a source modified peptide might not 
\ find any corresponding match in the reference protein database. SPC 
methods generally include in the database all modif ied/irtutafced peptides 
that they .want to consider, which requires prior knowledge of the mass 
difference associated with the modifications/mutations taken into 

365 account. Accordingly, modifications whose mass difference with the 
unmodified peptide is unpredictable (such as glycosylations) cannot be 
taken into account by SPC methods. In addition, including all possible 
modifications/mutations of the peptides in the database is unrealistic 
due to the combinatorial explosion it implies. As a result, SPC methods 

370 usually take into account only a few very common modifications 
occurring on specific amino acids, such as methionine oxidation or 
cysteine carbamidomethylation. 

In addition to the combinatorial problem, SPC algorithms have two other 
375 limitations. First, they consider the peaks independently of each 
other, thereby losing some important information contained in MS/MS 
spectra. Secdhd, SPC algorithms need to allow a large error tolerance 
when used with badly calibrated spectra. As a result, the high 
intrinsic accuracy of current mass spectrometers is basically lost. 

380 

Two non-SPC methods have been described: spectral convolution and 
spectral alignment, with PEDANTA (Pevzner et al., 2000; Pevzner et al., 
2001) their corresponding tool, which are claimed to be very efficient 
in dealing with modif ications /mutations, including unpredictable 
385 modifications. Indeed, they have a major advantage over SPC methods, 
because they use logical constraints imposed by the spectrum peak 
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composition to limit the number of considered modifications /mutations . 
One obvious trade-off of these approaches is that one must parse the 
whole peptide database without using the parent mass as filtering. In 
390. addition, the combinatorial problem grows with the nxomber of 
contemplated mass shifts. Accordingly, the number of 
modifications/mutations considered must be kept sufficiently low in 
order to allow identifications that are sufficiently discriminating. 

' 395 

SUMMARY OF THE INVENTION 

According to the present invention, tandem spectrometry data (MS/MS 
data) obtained experimentally from peptide and/ or protein-containing 
400 samples is interpreted and structured in a way allowing full 
-exploitation of the information contained in it during matching of the 
structured data with biological sequence database. 



405 DESCRIPTION OF THE DRAWING 

Fig. 1 is a flow chart showing the general pathway of the method for 
identifying peptides or proteins from MS/MS data according to an 
embodiment of the present invention, 

410 

DESCRIPTION OF THE INVENTION 

The present invention concerns a peptide and protein identification 
method using MS/MS data, obtained by any standard or non-standard 



BNSDOCID: <WO__2004008371A1_L> 



wo 2004/008371 PCT/IB2002/002731 

17 



415 method of tandem spectrometry, such as, for example, ESI/MALDI Q-TOF 
MS, ESI/MALDI lon-Trap MS, ESI triple quadrupole MS or MALDI TOF-TOF 
MS. Instead of directly comparing the experimental MS/MS spectrum with 
theoretical sequences from the database as in SPC, the method of the 
present invention compares an interpreted and structured view of the 

420 experimental MS/MS spectrxim with theoretical sequences. 

In the method of the invention and referring to Figure 1, one first 
performs tandem spectrometry on a sample 0, containing one or more 
protein or peptide. The MS/MS spectrum is then translated into a peak 

425 list 1, listing discrete mass peaks. This step can be performed by 
standard mass spectrometry equipment. The resulting peak list 1 is then 
interpreted into a list of possible mass explanations {interpreted peak 
list 2) taking into account physico-chemical knowledge, notably 
concerning the mass spectrometer, fragmentation energy, levels and 

430 chemical notions (ion type, charge number, etc.). The interpreted peak 
list 2 is then transformed into a structured representation 3, taking 
into account biological knowledge - notably amino acid properties 
and preserving at least the following information: 

435 - Mass/charge ratio of the peaks 

- Mass/charge ratio of the parent peptide 
Charge of the parent peptide 
Intensity of the peaks 

440 Identification of the peptide is performed by matching said structured 
representation with a biological sequence database. Said database 4 is 
built from any source of biological sequences 5 such as a nucleic 
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> f <^atabase translated into a protein or peptide database « or any subset 

* J» « 
II- 

I of such databases. A number of sequence libraries can be^ used, 

\ 

445 I including for example GenBank (Benson et al . , 2002), EMBL (Stoesser et 
al., 2002), DDBJ (Tateno et al . , ,2002), SWISSPROT {Bairoch and 
Apweiler, 2Q00), and PIR (Barker et al . , 2000). The matching with the 
biological sequence database is performed prior to any reduction of the 
structured representation 3 into one or a limited number of amino acid 

450 sequences; , in contrast to de novo sequencing. The matching process 

I ' • . 

leads to a similarity score 8 for each peptide sequence. This score iS: 
then' used to determine the best peptide match or matches 9. 

The present invention also provides a protein identification method 
455 comprising the steps of the peptide identification method just 
described, and comprising a further step consisting in using the 
peptide matching information for identification of the corresponding 
proteih or proteins in a protein database. 

460 In a preferred embodiment of the invention, the structured 
representation matched with the database is a graph 3 wherein vertices 
6 of the graph 3 represent "ideal" fragments, built from MS/MS peaks 
(in the interpreted peak list 2) under a ionic hypothesis. Each vertex 
6 representing a fragment indicates among others the molecular mass 

465 value of said fragment, the specific ionic hypothesis (ion type) for 
this fragment, and is assigned a score value expressing the credibility 
level for the vertex. Two vertices 6 are connected by an edge 7 
whenever their mass difference is equivalent to the mass value of one 
or more amino acids, depending on the combinatorial level chosen. 
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470 ^ ILetters representing these specific amino acids are attached to the 
J edge 7. Accordingly/ the graph 3 represents all amino, acid tags and 
complete sequences that can possibly be built from the ks/MS spectrum. 
ldentifica«.ion of the best peptide match or matches 9 is performed 
using the similarity scores 8 obtained by comparing theoretical 

475 peptides from the peptide sequence database 4 and the graph 3. 

The method* of the present invention compares the structured 
representation (or graph) 3 with theoretical peptides from a peptide 
sequence database 4. In contrast, to identification by de novo 

480 sequencing followed by sequence matching - that uses database 
information only after reduction of the graph to one or several 
sequences the present invention directly uses database information 
to direct, the comparison with the structured representation or graph. 
The goal is to find sections (sets of consecutive edges 7) of the 

485 structured representation or graph 3 which best explain the peptide. 
Although a section can be viewed as a classical tag encompassing 
sequence information, it is more than that as it contains additional 
information used in the comparison process. 

490 In the present invention, the structured representation in general, and 
the graph structure in particular, have significant advantages over 
existing methods. This approach first eliminates the calibration issue 
during the comparison process. As already mentioned, peak masses in 
MS/MS spectra can be shifted of a significant value in spite of the 

495 high intrinsic accuracy of the spectrometer. As a result, existing 
identification methods based on SPG must allow for a high tolerance 
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error when comparing peak masses and theoretical fragment masses, which 
leads to a significant increase of the noise level, hence of the nxomber 
of false positives. The method of the present invention compares 

500. differences of peak masses with differences of theoretical masses. 

Because differences of adjacent masses are weakly influenced by 
calibration errors, the method of the present invention allows to fully 
take advantage of the spectrometer accuracy. Another advantage of the 
structured representation is that it allows to take into account not 

505 only the number of peak matches (as in SPC) , but also the number of 
' successive matches suscepitible to explain the sequence. 

In a preferred embodiment of the invention, the matching of the 
structured representation with sequences in the database is performed 
510 by parsing the structured representation or the graph according to each 
. database sequence, each parsing leading to a score correlating each 
database siequence to the structured representation or graph. 

This approach allows notably to compare the structured representation 
515 with any sub^sequences of the peptide sequence database, each parsing 
leading to a score correlating the sub-sequence with a section of the 
structured representation or graph. In case of incomplete spectral 
information, non-linked relevant sets of successive edges (sections) 
can be combined together to form a same peptide sequence. In case of 
520 modified source peptides, this approach also allows to combine non- 
linked relevant sets of successive edges (sections) according to a 
modification hypothesis. 
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Representations under a graph structure allow to keep all the original 
information, as well as to consider information coming from many 
different sources during the comparison process. The graph includes two 
information types : first, local information, which are used for the 
path building in order to favor most pertinent edges and which are 
stored in variables associated with vertices and edges (as the vertices 
mass, intensity, score or the edge amino acid) , and seconci, global 
information, which describe path pertinence related to the current 
peptide or to any subsequence belonging to it, and possibly stored in 
weights associated with edges. Local and global parameters must be 
weighted and combined in a way maximizing the performance of the 
535 identification algorithm, and allowing sufficient discrimination 
between the peptide ranked first and the other candidates . Using a set 
of identified spectra from a known mass spectrometer, it is possible to 
optimize the weights with genetic algorithms (Gras et al., 2000; Gras 
et al., i999) . 



540 



In another embodiment of the invention, said parsing is performed 
through the use of a Swarm Intelligence- type algorithm (Kennedy .and 
Eberhart, 2001; Bonabeau et al . , 1999). Swirm intelligence is a form of 
545 distributed artificial intelligence: self -organization of 
unsophisticated units - agents evolving and interacting within a 

given environment and able to manage direct and/or indirect 
communication, results in the emergence of an intelligent collective 
behavior. 

550 



BNSOOCIO: <WO__a004008371A1_L> 



I 



WO 2004/008371 PCT/IB2002/002731 

22 



ji I', j 5|tn still another embodiment of the invention, the Swarm Ihtelligence- 
♦* ^ type algorithm is an algorithm called '^Ant Colony Optimization* (AGO) 
s (Dorigo and Di Caro, 1999) . ACO algorithms are defined, as multi-agent 
systems inspired from real euit colony behavior. The prixiciple of ACO is 

555 to explore, iteratively and simultaneously, different solutions of a 
given problem by an ant-agent population. The emergent collective 
behavior is guided by indirect communication between the ants, mediated 
by environmental modifications (stigmergy) . Ants modify their 
environment by depoisiting given amounts of pheromone, which are locally 

560 accessible and affects the behavior of the other ants. In this 
embodim^t, an ACO algorithm inspi^red from the •* trail -laying/ trail ~ 
following" . foraging behavior of ants is used to score the matching of 
current peptide of the database with the structured representation. 
Since ants can find the shortest path connecting the colony to the food 

565 source, it is possible to -exploit the rules governing the foraging 
process and use them to find good scoring paths in the graph. Each -ant 
obtains a score depending on the quality of the found solution. The use 
of virtual pheromone allows good solutions to be memorized and act as a 
positive feeidback (intensification of the search) . In order to avoid 

570 premature convergence, a certain amount of pheromone also evaporates at 

• each iteration (negative feedback, diversification of the search) . 

The modified ACO used to parse the graph first sets the pheromone 
quantity of each edge to a tiny value. Then, the ants parse the graph 
iteratively. At each iteration, the ants move on the graph from one 

575 vertex to the other, using existing edges or, if allowed, jijmping from 
one vertex to the other until a stojp criterion is reached (for example, 
when arrived on a vertex having no successor) . The choice of the next 
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edge results from a probabilistic computation, taking into account both 
local parameters (i.e. the score of the successor vertex) and the 

580 global learning already done (i.e. the amount of pheromone on the 
successor edge) . At the end of each iteration, some pheromone is 
automatically removed from each edge (evaporation), while some 
pheromone is added on edch edge parsed by an ant (the exact amount 
being dependent on the ant's score). As a result, the algorithm allows 

585 gradual convergence toward one or several good scoring sections, which 
can be further correlated in order to maximally cover the theoretical 
candidate peptide, ultimately leading after analysis of all peptides to 
a ranked list of candidate peptides. 

The AGO algorithm has several advantages. For example, the stochastic 
590 ..nature of the ant motion allows to parse any path in the graph. All 
poissible mutations compatible with the MS/MS spectrum are implicitly 
represented in the graph, and possible modifications can be 
contemplated by allowing the ants to jump from one vertex to another, 
unconnected one. Like spectral alignment methods, the present invention 
595 uses the spectrum logical constraints to limit the combination number 
of possible modifications. In addition, it drastically restricts this 
number by allowing only directed jumps joining relevant sections of the 
representation or graph. Thus, only modifications enhancing the global 
correspondence between the sequence and the. spectrum are considered. It 
600 is also possible to restrict the vertices allowed for an ant, depending 
on the vertices already parsed by this ant. This allows to accept, for 
example, only one missed- cleavage : an ant having used an edge 
corresponding to a lysine could avoid to further incorporate a second 
lysine. 
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I An additional advantage of the present invention is that switching from 
I it to a more traditional de novo sequencing mode is straightforward, by 
simply letting aside, the information coming from the database. 

610 The invention also provides a system comprising a computer linked to 
one or more mass spectrometers and one or more biological sequence 
databases, said computer comprising a program for performing the steps 
of the methoids described herein. 



615 The invention also provides a computer-readable medium comprising 
instructions for causing a coxnputer linked to one or several mass 
spectrometers and to one or more biological sequence databases to 
perform the steps of the methods described herein. 



620 



DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT 

'625 

The following paragraphs provide a detailed description of MS /MS data 
treatment and identification according to a preferred embodiment of the 
invention, combining a graph representation and an ACO algorithm and 
called Popitam (Peptide Or Protein Identification from TAndem Mass 
630 spectrometry) . 



I 

I 

'I 
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' ^ ^* Peak int.erpretation 



Let us define Sexp= {Si, S2, . . . , S|scxp| } , the experimental MS/^S peak list to 

635 identify, ^and a set of ionic hypothesis A={t1i,ti2 ti,a|). A ionic 

hypothesis can be seen as a possible interpretation of a peak. Each Tii 
has four attributes, which are . presumptions concerning the ionic 
fragment Sj^ measured by the spectrometer : an offset value o(t1ic) ; i.e. 
the mass difference between the ionic fragments and the corresponding 
640 b-ioh type fragment (for comprehension purpose, we will call such 
fragments i:>- fragments, and their corresponding masses Jb-jnasses) , .a 
terminus side t{lik) (N-term or C-term) , a number of charges c {T|k) , a'nd 
an approximated occurrence probability p(ilk) . The probability p{r\^) 
depends among other things on the spectrometer used, and can be 
645 determined dyring a learning phase using a set of identified spectra 
(Dancik et al . , 1999). 

The interpretation process consists in attributing to each peak from 
Sexp a ionic hypothesis con^rising all four attributes described above. 
Therefore, each peak s^ from Si„t will be characterized by a mass/charge 
ratio Ji(Sj) , an intensity Ks^) , and a ionic hypothesis T)(Si). The 
number of elements in the interpreted peak list Si„t is : = |5„p|-jA| . 

This approach means that at least |A|-1 interpreted peaks computed from" 
a given peak in Sc^p are false. 



650 



655 
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ZZ. Grapli construction 

660 Let us define a spectrum graph G=(V,E) as a directed acyclic graph, 
with a set of vertices V={vi, V2, - . • , V|v| ) and of edges E = {eij | i< j< | V| , 
vi and Vj £ V} . Each vertex vi is characterized by a b-mass, |i(Vi} and 
its corresponding ionic peak mass/charge ratio li'iVi), an intensity 
v®(Vi), a score a(Vi) , a ionic hypothesis Ti(Vi), a family F(Vi), and a 

665 successor list succ<yi), while each edge eij € £ is characterized by a 
pheromone trail x(eij} and a label X(eij) • 

II. 1) Building the vertices : 

670 6 is built from the peak list Sint- The first step is to transform all 
interpreted peaks into b-ions charged once, which represent N-terminal 
"ideal" fragments. 

Each peak from Sint leads to a vertex Vi . Given Mesip the experimental 
parent mass, with Mexp= (Mobs- D -c (Mobs) » Mobs being the mass/charge ratio of 
675 the peptide parent mass, and c(Mob8) its charge number, we built the 
vertices according to algorithm 1. 

Algorithm 1 : Building the vertices 
i=0; 

680 For each Sj G Si„t i 

if (t(ri^)) = "N-term") 

P <Vi)^c(r|^)>P(Sj)-(c(n(si))-l)-o(n(sj)) 
if (t(r|^j)) = "C-term") 

P ( Vi) ^ - [c(n(s3)) p (Sj)- (c(ri^))-l)-o(n(&,))l 
y^(Vi)<-u(Sj) ; 

i*(Vi)^normalize(i (s^)\ 
i + +; 

} 
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685 

We also create an initial vertex corresponding to the empty sequence 
and a final vertex corresponding to the complete sequence. Therefore, 
690 the number of vertices is equal to |Si„t| + 2 . 



II. 2) Vertex families ; 

695 

For each vertex, a family F of neighbor vertices is defined. The 
concept of family is based on the idea that when a b- fragment is 
•represented by several ionic peaks in S«cp. the computed b-masses >i(vi) 
of theses peaks will be almost equal. The family building is hence 

700. based on the vertex b-mass differences, which must be lower than a 
specified threshold. We chose not to merge the vertices as described in 
(Dancik et al . , 1999), because the merging process does not manage the 
calibration error on the peaks and depends on the parent mass accuracy, 
which is often quite low. Accordingly, two b-masses representing the 

705 same b-fragment and derived by ionic hypothesis of different terminal 
types (t(Ti(vi) )?fct (ii(vj) ) ) can be quite different when compared to the b- 
masses obtained from ionic hypothesis of same terminal type. Such b- 
masses therefore cannot be merged because there are too different or, 
if merged can produce a new vertex with a substantially less accurate 

710 b-mass. In order to avoid this problem we do not merge the vertices, 
but build vertex families F(vi)=={vi. . .v,p(vi,|) containing all neighbor 



BNSDOCID: <WO__2004008371A1_L> 



wo 2004/008371 PCT/IB2002/002731 

28 



\ 

n 



•'ft . 

! «» i Vertices possibly belonging to the same b-fragment. This approach 
' I Allows to keep the b-mass of the vertices unchanged, and hereby fully 
) benefit of the accuracy of the spectrometer. In addition, the algorithm 
715 used for building the families is not greedy - as is the merging 
algorithm "proposed by Dancik but is exact. 

A vertex Vj is added to a family F(Vi) according to the following 
rules. First, the two vertex b-masses must be close enough. As shown in 
equation X, the threshold must be adapted, . depending on whether , the two 
720 vertices "joined in a same family are derived by ionic hypothesis of a 
same, terminal type or of different terminal types. 



Equation 1 : | p(v^)-M(vi) | < e V 
with e = Ci if t(n(vi))= t(n(vj)) c = if t(n(vi))9ft t(n(vj)) and < 



725 

Second, the two vertex b-masses have to be issued from different ionic 
hypothesis (iKyi) != T|(Vj)). 

Algorithm. 2 : Building the families 

730 For i = 1 to |V| 

F(Vj) = 0; 
testl = TRUE; 
While (testl ) { 

Vj <- find the new closest vertex (Vj) ; 

if (termCv^) == term(v^)) e = e^; 

else e = 

if (|v, « vJ < E) { 
test2 = TRUE; 
'^^^ For each v^ 6 F(Vj) 

if (Tl(v,,) == Ti(v^)) : test2 = FALSE; 



if (test2) : F{^r^) = F (vJ U v,; 

) 

else testl = FALSE; 



• I 
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II. 3) Scoring the vertices ; ' ' 

Because the vertices are built under some assumptions, we need a value 
defining the credibility level of each vertex. This value is 
745 represented by a score a(vi) > defined according to a non exhaustive 
list of criterions. Two criterions are currently taken into account, 
leading to a redundancy score p(Vi) and a probability score n(Vi) . 



750 



Equation 2: o ( v^) = P<^^n(vi) 



Once the families are defined, it is possible to compute p(vi) and 
n(Vi). The redundancy score p{Vi) must be increased according to the 
family size as several equivalent b-masses confirm the ionic hypothesis 
of Vi, while the probability score n(vi) takes into account the 
755 occurrence probability p(ii) of the family members : 

Equation 3 : " «^i> = .^,n,^,P(n(vi))-^^^^^^^ 

760 II. 4) Connecting the graph : 

If the b-masses of two associated vertices vi and v^ differ by the 
value of one or several amino acids, they can be connected by an edge 
eia. According to the number of amino-acids included in a given edge. 
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765 the latter can be called a simple edge { | A. (eij) | =1} , a double edge 
( |X(eij) 1=2) , and so on. Let A={ai,a2r . . wa|A| } be the alphabet of the 
amino-acids . A contains all common amino-acids, as well as some 
modified amino acids, such as carboxymethylated cysteine, 
carbamiidomethylated cysteine, or oxidated methionine. Each ai € A has a 

770 mass ^(ai) and a label A.(ai) . A^ ={ai,a|, , , . ,a|^c|} is the set of all 

combinations of 1 to N amino acids among |a| . . Because the eclge niomber 
increases exponentially with the value of N, the latter is usually 
small (typically N=2 or N=3) . 

Given pCa^), the sum of the masses of all amino acids in a^ , and A(a^), 

775 formed from the labels of the amino acids in a^ , the algorithm 3 shows 
the computation of the edges. The vertex list must be sorted according 
to the b-masses values. 

Algorithm 3 ; Connecting the graph 

780 

e = ei; 
e = e 2 ; 

p(aS)|<0 
(e . a^) ; : 

790 



For i =0 to I V I 

For j = i + 1 to I V I { 

if (t(n(vO)= t(n(v,))) 

else 

For n =1 to jA""! { 
■'^^ if (|p(v,)-p(v,)- 

createEdge 

) 
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> ^ XZI. Identification process 



795 



805 



810 



III. 1) The peptide database 



Let D={Pi,P2, . . .P|D|) be the peptide database used for the 
identification. The peptides Pc can be obtained from the whole or a 
subset of nucleic or protein databases. Pc are characterized by three 
attributes, tirst, their sequence Q(Pe) = {af,a5 a^^^^^^^) with aj e A. 

Second, their theoretical mass ^(Pc) (see equation 4). Third, an 
800 identification score score (Pe) . 

Given the terminus mass values ^i(N-term) and jA(C-term) , |i(Pc) is 
obtained as follows : 

Equation 4 : M (Pc> = M (N- term)+p (C- term)+ ^}i{a^) 



The identification process consists in comparing the peptides of D with 
the graph G and in correlating each peptide Pc e D with a score 
score (Pc). Given Mexp, the experimental parent mass of the spectrum, and 
r, a predetermined threshold, we have : 



Algorithm 4 ; Identif igat ion pyr>r^^ctfi 



For c = 1 to |d| 

If ( ||i(PJ - M^l < r ) 

score (P,) s= compare (P^,G> 



815 
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This algorithm results in a list of candidate peptides ranked by score. 
The following paragraph describes the compare function^ which performs 
the comparing of a theoretical peptide with the graph. 

820 

III. 2) Comparison process 

The comparison process between the graph G and a peptide Pc requires to 
find in G the sections best explaining Pc. A complete section is a path 
825 in the graph corresponding to a whole peptide sequence. We present here 
a possible non deterministic strategy to search, for a given Pc, the 
best complete section in G. The algorithm will be modified further in 
order to extract sections instead of cozc^lete paths. 

830 Let F={fi,f2, be the ant population. Each ant fk# walking on the 

graph at iteration t, builds a path which includes a set of vertices 

LV(fk) • subset of V, such that 

and consequently, a set of edges, denoted L^C^) c E of size |Le(f](}| . The 

835 quality of Le<^) is represented by the ant's score S'(fk) . The 
concatenation of the edge labels X(eij), with ei^ 6 Lb<^)' represents 
the sequence 

LQ(^)={a^a^...,aj-^^^^} , aJ:GA«= 
built by ant k. 

840 
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Algorithm 5 is an adaptation to our problem of an AGO algorithm. First, 
t(eij), the amount of pheromone of each edge ei^ e G is initialized 
(with To=10-'=), as well as the best complete path found in the graph 
(L*) and its associated score S{L*) > At the beginning of each iteration 
845 (t,Bax is the predefined total number of iterations), the amount of 
pheromone that will be added at each edge. AT(eij), is initialized at 0. 
Then, each ant parses the graph, building its own path L^g(f,,) and gets a 
. score sMfk). This score is used for updating the AT(eij) for each eij € 

a predefined constant value, chosen of a same order of 
850 magnitude as that of the optimal score. Authors have demonstrated that 
the value of Q has little influence on the final result (Theiler, 2001; 
Bonabeau et al . , 1999). If the path built by the ant obtains a higher 
score than S(L*), L* and S(L*) are updated. Finally, when all ants have 
parsed the graph and have added their contribution to the AT(eij) , the 
.855 graph is updated, ci> e [0;1[ being the evaporation rate. At the end, 
the compare function returns the score of the best path attributed to 



BNSOOCID: <WO__^a004008371A1J_> 



wo 2004/008371 PCT/IB2002/00273*i 

34 



i> i.'.i 

f 

860 ! 



870 



Algorithm 5 : Fin ding the best path in G for a peptide P, 
Initiation s 

V = 0;-... • ■ ■■' . 

S(L*) = 0; ' 

For each edge e^^ e E : T(e,j) = x,, 



865 Iterations s 

For t = 1" to* 

For each b^^ e E ; AT(eij) = 0; 
For k = l to |f| { 

" (.l(Ac>LE(Ac)'I^fe))=Pa"eGraph(Pc,4) / 
sMAc)^scoreAnttc.Ac'LUi^).LUAcXLQfe 

For each e^^ e L^(4) : At(ei^)= AT(eij)+^-^^ ; //update Ax(eij) 
if ^^*)< 5*^(4)) { // update best path 

■ > ■ • 

875 For each e^j 6 E : i (eij)4— (l-a))*T (6^^)+^! (e^j) ; // update graph 

} 

return SdJ**); 



880 



A more detailed description of the parseGraph and scor eAnt functions 
follows: 



885 
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' ^ III. 2a) Parsing the graph t 

The ant is first placed on the initial vertex Vj. it dan go forward 
as long a»^ the current vertex vi has any successors (succ (Vi) * 0) . and 
890 as long as the length of its built sequence |Lo(fk) | is smaller than 
the length of the current database sequence |Q(Pe)|. The transition 
rule vsed. to go from a vertex vi to a vertex vj with vj e s^cc(vl) 
depends ori three pieces of information. The first one is visibility, 
represented by (J(vj) . the score of the successor vertex. It can be 
895 considered as a local parameter. The second piece of information 

corresponds to the memory of the learning previously done by the ant ■. 
population. It is a global parameter, representing the amount of 
pheromone laid on the edge eij, T(ei,) . Finally, the third piece of 
information is the sequence of the current database peptide Pc. Indeed. 
900 if the label of the next edge ei, matches the next amino acid in the 
sequence Q(Pc) , the transition probability is multiplied by a 
predefined constant value dependent upon the edge label length. 

Given a and p. two adjustable parameters controlling the relative 
905 weight of the learning and the visibility, p^(ei^), the probability for 
ant fk to take the edge eij at iteration t, p^Ce^) the set of these 
probabilities for all succ(Vi). and QlP.) = lal.al.:...a^^^^}, the current 
peptide sequence : 
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AlQQrithm 6 : Parsing. G with ant f, 
i=l; 

L^(^) = 0; Li^(^) = 0; L^(4) = 0; 
while (succ(vi)=0)and (|l^(4J<|q(Pc]| ){ . 
for each e succ(Vi) { 

VSUCC(V|) 

if ^match [afc,,,^,!. • • • .aj;.,^^^(.,^j.X(e„)jj : p|'(eij) = p^(e„).c,^(.^^j; 

// here, we compare all permutations in ^(eij) 
with the amino acids af^^^^^^, . . . .aj^e.^^^^j^^^j 

add(p^(ei), pj'teij)); 

normal i zel^^ ( ei )) 
e^j =chooseEdgel(p^(ei)) , 
- addt^v(^) 'Vj) 
add^^E(^) ,eij) 
add(L^o(f„),A(ei3,)) 
i<- j; 



III. 2b) Scoring the ants 



At the end o£ each iteration t, one must evaluate the similarity 
between the current peptide Pc and the different paths used by the 

920 ants. Each ant gets a final score S^(fk> depending on its path L^C^). 

The goal is to include in S^(f)c) all possibly relevant information from 
different sources (see ec[uation 5) . For example, in order to take into 
account information coming from Sint we can use the intensity of the 
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peaks, stored in i®(Vi), Vi € h%{f^) , and compute an intensity score 
925 intS. From the ionic hypothesis set, we can build a relevancy score 
relS, expressing the relevancy, of the vertices parsed by fk. The 
current peptide sequence can be used in a covS score that would express 
the similarity between the peptide sequence Q(Pc) and the sequence 
Lq<4) built by the ant. The quality of the correlation between the b- 
930 masses of the used vertices and the theoretical masses expected from 
Q(Pc) can also be taken into account as a regression score called regS . 
Still other information can be added, such as rules resulting from the 
expertise of biologists used to studying MS/MS data. 

935 ' Equation 5 : S'(fk) = f(intS, relS, covS, regS; *•.); 

The next sections show implementation examples of the sub-scores intS, 
relS, covS and regS used in our current algorithm. 

940 The coverage score recS represents the sequence similarity between the 
current peptide Pc and the sequence built by an ant f*. It is computed . 
with an alignment function as for example a Smith and Waterman 
algorithm. Given Q(Pc) and Lgii^) : 

Algorithm 7 : C overage score 

945 

recs=alig4j(^) ,4(4)); 

The relevancy score is the mean of the used vertices score. It is 
computed as shown in equation 6. 
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950 / Equatipn 6 : relS = ^i4i^^ 



965 



Similarly/ the intensity score is computed as follows: 



Equation 7 : intS= "*f^^^\ 

The regression score measures the global correspondence between the 
955 experimental masses H°(Vi) of the vertices included in the ant's path 
and the "corresponding theoretical masses R(Pc) = {ri, r2, . . . , r|R(pc>| } 
computed from the current database peptide sequence Q(Pc) (Gras et al.., 
2000). The relation between these masses is first plotted on a graph, 
with the experimental masses as. abscissa and the theoretical masses as 
960 ordinate, and the set of points allows to calculate a linear 

regression. The mean of the deviation between the points and the linear 
regression represents the regression score regS • 

Given y = ax+b, the linear regression, p^iVi) e L^(4) the experimental 
masses and their corresponding theoretical masses ri e R(Pc) : 



Algorithm 8 : Computation of regS 



For each y"{Vi) € LyC^) { 

add^, M®{Vi) , Q(Pc)) // compute the corresponding 

theoretial mass r^ and add it to R 
linearReg(a,b,R,I^{4)) // this function makes the regression 



regs = -i^ 



|Lv(fic)| 
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EXPERIMENTAL EXAMPLE * 



975 



A preliminary implementation of our algorithm has been tested on a 
training set of MS/MS spectra (only complete paths, no unknown 
modifications) • 92.1% of 101 spectra were well identified. Here are 
some result examples. 
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MSMS file 
Peaks- used/ tot 
Parent_jnass (M/H-f ) /charge 
Vertices 

Edges ( s impl e / doubl e ) 
Ants nb / Iter nb : 



DSNNLXLHFNPR . dta 

56 / 935 
1485.63 / 2 
170 

482 / 4345 
101 / 5 



990 



995 



1000 



2. 



3. 



s_n*| fin_s** 
— I 

0 I 1.396 
I 

0 I 0.312 
I 

0 I 0.252 
I 



access 
P09382 
Q05586 
P09848 



id 



LEG1_HUMAN 
NMZ1_HUMAN 
LPH_HUMAN 



sequence_dtb/ sequence_graph 

DSNNLCLHFNPR* * ♦ 

SdNNLXLHFNPR* ♦ * * 

FANYSIMNLQNR 

ewNIsinmLPNR 

DPSNQEDVEAARR 

rxLNQEvdaePR 



* s_n = start node 

** fin_s = final score 

♦** theoretical sequence read in the database 
**** sequence parsed in the graph (uppercase 
= double edge) 



simple edge, lower case 
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1010 



1015 



1020 



MSMS file 
Peaks used/ tot 
Parent_jmass (M/H+ ) /charge 
Vertices 

Edges (simple/double) 
Ants nb / Iter nb : 



1025 2. 



3. 



s n 



.0 



f in_s 



1.970 



1.970 



1.970 



1.079 



0.677 



EFTNVYIK.dta 

40 / 260 

I 

1012.51 / 2 
122 

349 / 3153 
74/5 



access 
Q13310 
Q15097 
P11940 
P42694 
P46821 



id 



PAB4_HUMAN 
PAB2_HUHAN 
PABl^HUMAN 
Y054_HUMAN 
MAPB_HXJMAN 



sequence_dtb/ sequence_graph 

EFTNVYIK 
EFTNVYIK 
EFTNVYIK 
EFTNVYIK 
EFTNVYIK 
EFTNVYIK 
QDYEMALK 
QDeyaoLK 
LKHLDFLK 
LKlhdf LK 



1030 



MSMS file 
1035 Peaks used/ tot 

Parent jnass (M/H+) /charge 
Vertices 

Edges (8 ixnpl e / doubl e ) 
Ants nb / Iter nb : 

1040 



EQIVPKPEEEVAQK . dta 

64 / 317 
1622.83 / 3 
194 

579 / 4566 
. 120/5 
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sequence_dtb/ sequence_graph 



EQIVPKPEEEVAQK 
qevi PKPEEEVAQK 
LLEEIHNHSTFVGK 
LLEEvkCHSvzVG 
RPENPKPQDGKETK 
RPtdPKPQvxgiQK 

I - . ' I 
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1045. # I s_n I fin_s | access | id 



2. 



1050 . 



1.374 



0.396 



0.394 



P18621 



P36383 



P16991. 



RL17_HUMAN 



CXA7_HUMAN 



YB1_HUMAN 
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CIAIMS 

1055 

1. A peptide identification method comprising the following steps: 
(9) Performing tandem mass spectrometry on a sample containing 

one or more protein or peptide, 
(b) Reducing the resulting spectrum to a peak list. 
1060 (c) Listing possible interpretations for said peak list into an 

interpreted peak list, taking into accoiint physico-chemical 
knowledge. 

(d) Structuring said interpreted peak list into a structured 

representation taking into account biological knowledge and 
1065 preserving at least the following information: 

Mass/charge ratio of the peaks obtained in step (b) 
Mass/charge ratio of the parent peptide 
Charge of the parent peptide 
Intensity of the peaks 
1070 (e) Matching said structured representation with a biological 

sequence database prior to any reduction of the structured 
information into one or a limited number of amino acid 
sequences . 

(f ) Determining the best peptide match or matches within said 
1075 database. 

2. A protein identification method comprising steps (a) to (f) of 
claim 1, and further comprising a step (g) consisting in using the 
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peptide matching information of step (f) for identification of the 
1080 corresponding protein or proteins in the protein database. 

3. The method of claim 1 or 2 wherein the structured representation 
of step, (d) consists in a graph wherein: 

Vertices of the graph represent individual elements of the 
1085 interpreted peak list, translated into potential b-ion type 

peptide fragments. 

Edges link vertices representing said b-ion type peptide 
fragments whose molecular weights differ by a value 
equivalent to the molecular weight of one or more* amino 
1090 acids. 

. 4 . The method of anyone of claims 1 to 3 wherein the matching of 
step (e) consists in successively parsing the structured representation 
of step (d) according to each database sequence, each parsing leading 
1095 to a score correlating each database sequence to the structured 
representation. 

5. The method of claim 4 wherein the parsing is performed by a Swarm 
Intelligence Algorithm. 

1100 

6. The method of claim 5 wherein the Swarm Intelligence algorithm 
is an Ant Colony Optimization algorithm. 
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/ f 7 . The method of anyone of claims 3 to 6 wherein non-linked relevant 

* h n 

1105 f sets of successive edges are combined together according to a 
! modification hypothesis. 



8. A computer-readable medium comprising instructions for causing a 
computer linked to one or several mass spectrometers and to one or more 

1110 biological sequence databases to perform the steps of the method of 

anyone of claims 1 to 7. . 

ti • • ■ ■ 

9. • A system comprising a computer linked to one or more mass 
spectroriteters and to one or more biological sequence databases, said 

1115 computer comprising a program for performing the steps of the method .of 
anyone of claims 1 to 7 . 
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1/1 




Pxoteln or 
peptide Baraple 




TANDEH MASS 
SPECTROMETRY 



MS/MS peak liar 

parent_jDeptide m/z and charge 
peak 61 : m/2y... intensity 
peak S2 : m/z, intensity 




INTERPRETING 

physico-chemical knowledge 

- fragmentation energy levels 

- chemical notions . . . 



interpreted peak list- 

parentjeptide mass 
interpreted peakK S}i : m/z, 
interpreted peak Ss : m/z. 



intensity,, 
intensity, 



ionic hypothesis 
ionic hypothesis 



STRUCTURING 

biological knowledge 

- amino acids 

- experiment parameters 




graph oar any other 
representation 



structured 



possible 
filter 



peptidie 
sequence 

database 



8 



score 



score (P2) 



^ ... ^ 



COMPARING 



any source of 
biological sequences 



T 



BEST PEPTIDE MATCH 
OR MATCHES 
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